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Abstract 

Logistic models are studied as a tool to convert output from numerical 
weather forecasting systems (deterministic and ensemble) into probability 
forecasts for binary events. A logistic model obtains by putting the log- 
arithmic odds ratio equal to a linear combination of the inputs. As any 
statistical model, logistic models will suffer from over-fitting if the number 
of inputs is comparable to the number of forecast instances. Computa- 
tional approaches to avoid over-fitting by regularisation are discussed, and 
efficient approaches for model assessment and selection are presented. A 
logit version of the so called lasso, which is originally a linear tool, is dis- 
cussed. In lasso models, less important inputs are identified and discarded, 
thereby providing an efficient and automatic model reduction procedure. 
For this reason, lasso models are particularly appealing for diagnostic 
purposes. 



1 Introduction 

Providing forecasts of future events in terms of probabilities has a long and 
successful history in the environmental sciences. The inherently unstable dy- 
namics of the atmosphere in conjunction with incomplete information on its 
current state prohibit unequivocal forecasts. Probabilities allow to quantify 
uncertainty (or the absence thereof, i.e. information) in a well defined and 
consistent manner. So called "subjective probability forecasts", compiled by 
experienced meteorologists, were issued by several meteorological offices from 
the 1950's onwards. These forecasts were based on synoptic weather data col- 
lected from a large number of stations. On a scientific (non-operational) level, 
probabilistic weather forecasts were discussed much earl ier, either based on syn- 
optic information as well as local station data (see e.g. Bessonl Il905t llVlurphv . 



19981 . provides an excellent account on the history of probability forecasting 
of weather, along with many references). There is evidence that subjective 
probabilistic weather fo recasts were co mpiled from non-synoptic information 



as early as the 1790's (jMurphvl . Il998h . Desirable properties of probabilistic 



forecasts as well as methods to quantify their succe s s have been i nvestigated i n 
various papers, see for example I von Mvrbach (|l913l) : lBrierj (Il950h : iGoodl (|l952l) : 
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Winkler and Murphvi (|l968h ; lMurphv and Winkled (|l977lll987h ; lMurph^l (|l993l 
1996), a list which is by no means complete. 

The advent of the electronic computer opened the possibility to calculate 
"numerical subjective probabilities", that is, to calculate probabilities from in- 
formation data using tuned algorithms. More specifically, an automated sta- 
tistical learning procedure can be employed to find a relationship (also called 
model) between the information data (also called covariates, features, or inputs) 
and the statistical properties of the variable to be forecast (also called target, or 
verification). Possible inputs might for example be provided by output from a 
numerical weather prediction system, in which case the problem is also referred 
to as ensemble calibration or probabilistic down-scaling. A model (or more 
specifically model class) which has gained some att ention in the meteor o logical 
comm u nity is the logistic model, see for example Tippet et al. ( 2007 ); Wilkd 
( 20061 ) ; lHamill et alj ( 2001 ) and also references therein for various alternatives. 
Logistic models, often also referred to as logistic regression, will be the subject 
of this paper. We will exclusively be concerned with dichotomic problems, that 
is, we are only interested in forecasting whether a certain event happens or not. 
In this case, the logistic model obtains by taking the logarithmic odds ratio 
l°g(i^p) °f the forecast probability p of the event as a linear function of the 
inputs. In other words, 



exp(x/3*) 
1 + exp(x/3') ' 



(1) 



where x are the inputs and (3 some coefficients. The maximum likelihood prin- 
ciple provides a convenient way to find the coefficients, but alternatives will be 
considered in this paper, too. In any case, the coefficients are found by optimis- 
ing the performance over some training data. As with other regression models 
though, this approach runs into problems if the number of inputs is of the same 
order of magnitude as the number of instances in the training data, in which 
case the inputs are typically also highly correlated. This is a common situation 
in weather forecasting, owing to the large number of available forecast sources. 
One way to avoid over-fitting in this situation is to manually restrict the number 
of inputs to the few that seem to be most relevant. This was carried out for 
example by iBessonl (|l905l ). but often such a study would require to assess an 
astronomically large amount of different combinations. 

Another way is to apply regularisation, which means to reduce the effective 
degrees of freedom of the model. Efficient regularisation techniques exist for 
linear models. Owing to the great similarity between linear and logistic mod- 
els, these techniques can be modified and applied to logistic models, as will 
be demonstrated in this paper. Logistic models will be defined in Section [5] 
Section [3] discusses how to regularise logistic models along with further com- 
putational and implementational aspects. Sections [5] and [3] therefore form the 
core of the paper. Before getting there, some notation and concepts need to be 
introduced (Section [2]). Numerical studies employing precipitation forecasts are 
presented in Section HI Apart from the predictive power of logistic models, they 
also permit to investigate the relative importance of the inputs. The material 
of this last section thus also demonstrates the capabilities of logistic models as 
a diagnostic tool. 
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2 Problem Statement and Concepts 



The primary aim of this section is to settle notational conventions and introduce 
some concepts. The general setup we have in mind is as follows. As in the 
introduction, let the target Y be a random variable taking the values and 1 
only, with Y = 1 indicating that the event under concern happened and Y = 
otherwise. The inputs X are random variables too, taking values in R d . By 
x, we will denote any generic point in M. d , while y is always either zero or one. 
The underlying probability measure will be denoted by P. The probabilistic 
relationship between X and Y is described through the following objects. Let 

f (x) :=F(X ex + dx\Y = 0), 

fi{x) := F(X ex + dx\Y = 1), ^ ' 

that is, fo and ft, respectively, are the densities of X given Y = and Y = 1, 
respectively. By 

n(x) :=¥(Y = 1\X = x) (3) 
we denote the conditional probability of the event "Y = 1" given X, and 

7f := F(Y = 1) (4) 

denotes the base rate or grand probability of the event "Y = 1". Finally, 

f(x) := F(X e x + dx) (5) 

denotes the unconditional density of the inputs X. The Bayes rule entails 
various relations between these objects, for example f(x) = ft(x)w+fo(x)(l— 7f). 

A model is a function p : M. d — » [0, 1] so that p(X) is the forecast probability 
of the event "Y = 1" . Generally speaking, the problem discussed in this paper 
is to find "good" models, where it remains to be defined what "good" means. 

Intuitively, it is clear that p(x) :— ir(x) is a good model. Unfortunately, tt(x) 
is not an empirically measurable quantity, and therefore "interpolating" or "fit- 
ting" 7r(x) is not a possible approach to determine p(x). A general operational 
criterion to fit (deterministic or probabilistic) relationships between measured 
quantities is to optimise the estimated performance, according to suitable cri- 
teria of performance. Such criteria are subject of the next subsection. 



2.1 Scoring Rules 

A scoring rule (|Goodl . Il952t iKellvl . Il956t iBrownl . Il970l : ISavagel Il97ll ) is a func- 
tion S(p,y) where p € [0, 1] and y is either zero or one. If p(X) is the forecast 
probability and Y is the corresponding target, then S(p(X),Y) quantifies how 
well p{X) s ucceeded in f orecasting Y . Two important examples are the Igno- 
rance score flGoodl . ll952h . given by the scoring rule 



S(p, y) ■■= - log(p) • y - log(l -p) ■ (1 - y), 
and the Brier score <|Brierl . ll95(Jl . given by the scoring rule 

S{p, y) ■= (y -p) 2 = (l- P ) 2 -y + p 2 -(l-y). 



(6) 



(7) 



These definitions imply the convention that a smaller score indicates a better 
forecast. 
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A score or scoring rule quantifies the success of individual forecast instances 
by comparing the random variables p(X) and Y point-wise. The general quality 
of a forecasting system is commonly measured by the mathematical expectation 
E [S(p(X),Y)] of the score, which can be estimated by the empirical mean 



1 N 

E[S(p(X),Y)} = -J2s( P (x,),y t ) (8) 



over a sufficiently large set {(xi,yi); i — 1 . . . N} of input-target pairs. 

Reassuringly, for the two mentioned scoring rules the score becomes better 
(i. e. decreases) with increasing p if the event happened, while if it does not, 
the score becomes worse (i.e. increases) with increasing p. Furthermore, both 
scores are proper. To define this notion, consider the scoring function 

s(q,p) :=S(q,l)-p + S(q,0)-(l-p) (9) 

where q,p are two arbitrary probabilities, that is, numbers in the unit interval. 
The scoring function is the mathematical expectation of the score in a situation 
where the forecast is q but in f act p is the true probability of th e even t "Y — 
1" . A score is strictly proper (|Brownl . Il970t iBrocker and Smith! . l2007h if the 



divergence function (or loss function) 

d(q,p) := s(q,p) ~ s(p,p) (10) 

is positive definite, that is, never negative and zero only if p = q. The divergence 
function of the Brier score for example is d(q,p) := (q — p) 2 , demonstrating that 
this score is strictly proper. The Ignorance is proper as well, since ()10p is just 
the Kullback-Leibler-divergence, which is well known to be positive definite. 

The mathematical expectat ion of a strictl y proper score allows for a very 
interesting decomposition (see Brockeii 120071 for a proof). For any strictly 



proper scoring rule, define the entropy e(p) := s(p,p). Furthermore let 7r p (r) := = 
P(Y = l\p(X) — r) be the conditional probability of Y = 1 given that p(X) = r. 
This quantity is a function of p, but it is a fully calibrated probability forecast. 
With these definitions, it can be shown that 

ES(p, Y) = e(7f) - Ed(ff , tt) + Ed(TT p , tt) + Ed(p, ir p ). (11) 

These terms can be interpreted as follows: The entropy e(rr) is the ability of 
the base rate tt to forecast draws from itself, and hence quantifies the funda- 
mental uncertainty inherent in Y. The term Ed(7f , tt) is positive definite and 
quantifies the average divergence of tt from its mean. It can hence be considered 
a generalised variance of tt. If the Brier score is used, this term is in fact the 
ordinary variance of tt. The term Ed(TT pi tt) is also positive definite and quanti- 
fies how much information is lost when going over from X to p(X). The term 
Eg?(/5, TT p ) is again positive definite and quantifies the imperfect calibration of p. 
The reader might want to convince himself that if the Brier score is used and 
furthermore X = p{X) (i.e. the inputs already comprise a probability forecast), 
then relation (fTTj) agrees with the well known decomposition of the Brier score. 
In particular, if p(x) = tt{x), then also tt p = tt(x) and hence the third and fourth 
term in Equation (fTTj) vanish. We can conclude that the forecast tt(X) in fact 
yields an optimum expected score among all models which can be written as a 
function of X. To achieve yet better scores, more information about Y is needed 
than what is contained in X. 
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2.2 Logistic Regression 



Comp r ehensive discussions of Logistic regression can be found in lMcCullagh and Nelder 



( 19891 ): Hastie et al. ( 2001 ). As mentioned in the Introduction, logistic regres- 



sion assumes a model of the form p(x) 
and 

9{») 



g(xP t ), where (3 are the coefficients 



exp(z) 



1 + exp(z) 
The quantity g [z) 



(12) 



the so-called link function. The quantity g~ L (z) = log(z/(l — z)) is referred 
to as the log-odds-ratio. We will call -q := x(3 t the linear response. From now 
on and throughout the paper, we will assume that the inputs x carry an entry 
1 in the first position and that (3 — (/3o . . . (3d) where (3q is referred to as the 
intercept. Since g~ 1 (p) = r), in logistic models, the log-odds-ratio equals the 
linear response. The coefficients (3 are determined by minimising the empirical 
score. Locally around the optimum, this minimisation turns out to be equivalent 
to weighted linear regression, as will be seen in the next subsection. Thereby, 
logistic models inherit various useful properties from linear models, as long as 
strictly proper scores are used in the empirical score. This fact will be exploited 
in the next section. 



3 Computational Topics and Regularisation of 
Logistic Models 



Consider the empirical score of a logistic model 

1 N 

fc=i 

where as before S is a scoring rule, g is the link function, and {(xk,Uk),k — 
1 . . . TV} is a set of input-target pairs, henceforth called training set. We let R 
denote the minimum of R with respect to (3, and (3 a corresponding stationary 
point. To find a stationary point of R, the Newton-Raphson algorithm can be 
used. The update step for this iterative algorithm can be written as 

, -l . 



PL w =P t -(*VM) L X*d (14) 



with the abbreviations 



X H := 4° ( 15 ) 

:= ^S{g{r, k ),y k ) (16) 
d 2 

Wk := g^ S (9(Vk),yk) (17) 

W fci := S k iw k , (18) 

where ij k := x k [3 l is the log-odds-ratio for sample k. In the case of the Igno- 
rance, it is easy to see that 

dfc = g{vk) - Vk, wfe = g(vk){i - g(vk)) (19) 

Equation (|14p is in fact very sim ilar to weighted least squares regression with 
linear models ( Hastie et all 120011 ) . 
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3.1 L 2 — Type Regularisation 

Whichever score or minimisation algorithm we employ, the coefficients (3 so 
determined will show poor out-of-sample performance if the number of degrees 
of freedom is of the same order of magnitude as the number of instances in the 
training set. Small changes in the training set will entail large changes in the 
coefficients, or in other words, the coefficients will exhibit large va riance. Apart 



from poor performance, the results become difficult to interprete (jHastie et al 



20011 ). To give a heuristic argument as to why this happens, suppose we want 
to fit a linear model to real valued data, and there are two highly correlated 
inputs. Any large value for (3\ (the coefficient corresponding to the first input) 
can be compensated by a large value (with opposing sign) for f3i (the coefficient 
corresponding to the second input). The algorithm will use this freedom to 
"model" the random fluctuations in the inputs. 

To avoid this behaviour, the degrees of freedom of the model have to be 
reduced (or, what amounts to the same, the range of variation of (3) , preferably 
in an adaptive manner. A straight forward approach is to search for a minimum 
of R only among all (3 with \(3\ 2 < fi with a regularisation parameter [i. Here 
\0\ 2 = J2k=iPl- The reader is reminded of the convention that (3 has d + 1 
entries in total, but we decided not to put any constraint on the intercept (3q. 
Furthermore, we assume that all inputs are centred and scaled so that they 
have mean zero and unit variance. To see why regularisation has the desired 
effect, and to obtain criteria for choosing an appropriate fi, the problem has to 
be brought into another form. For (3 to be a stationary point of R under the 
constraint \(3\ 2 < /i, it is necessary that there is a A so that the Lagrangian 

L(J3,\) :=#(/?) + A(|/?| 2 -m) 

has a saddle at (/?, A), which obviously entails that (3 is a stationary point for 

R x ((3):=R(f3)+X\f3\ 2 . (20) 

We arrive at the conclusion 

If (3 maximises R{(3) under the constraint \(3\ 2 < /i and if A is the 
corresponding Lagrange multiplier, then (3 is a stationary point of 
R~ x ((3). Conversely, if we fix a A > and let (3 be a stationary point of 
R x (f3), then (3 maximises R{(3) under the constraint \[3\ 2 < ^ := \f3\, 
and the corresponding Lagrange multiplier is A. 

It is probably more intuitive to optimise R{(3) under a size constraint on /? rather 
than to add a penalty term to the empirical score, as the former criterion makes 
the constraint on the coefficients more explicit. Augmenting the empirical score 
by a penalty term though (as in Equation I20| has computational advantages, 
in particular to establish a criterion for choosing the penalty A, as will be seen 
soon. Whichever option is taken, it is understood from now on that, having fixed 
either A or fi, the coefficients (3 depend on A. Clearly, assessing the suitability 
of a particular A by looking at either R{f3) or R\((3) is impossible, since both 
measures are optimal for A = 0. A less optimistic measure of performance is 
the leave-one-out score, which is defined as follows: Let be the stationary 
point of -jyi-j- J2k^i SidizkP*), Vk) — f|/3| 2 , that is, we remove the i-th point 



G 



from the training set. Having computed (3% for every i — 1 . . . N, we form the 
leave-one-out output gi := g{xi(3l) and finally the leave-one-out score 



I?,, 



(21) 



which is then investigated as a function of A (or equivalently /i) . The leave-one- 
out score evaluates every (3% on precisely that sample point which was removed 
from the training set before finding (3$ . The prospect of having to determine the 
coefficients N times in order to compute the leave-one-out score for a single A 
seems horrible at first glance, but a few calculations will help to simplify this 
problem drastically. In Appendix [Al it will be shown that approximately 



f]i := x l fi t i =fj + 



1 



1 — WiXiH~ 1 x t , 



x,TT 



with 



H = X*WX+ (N - 1)A, A = 



xfa + 2A/3 







V o o 



(22) 



(23) 
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All quantities that carry a hat " are evaluated at (3. The right hand side of 
Equation (|22p is a function of (3 and A and therefore can be calculated without 
having to determine all N leave-one-out coefficients explicitely. Furthermore, 
to compute rji, only very few operations are required repeatedly for every i. 
The matrix inversion H" 1 needs to be performed only once. In fact, if the 
Newton-Raphson method is used, all quantities can be recycled. 

The leave-one-out error will in general be larger than R(f3). The difference 
allows for a very interesting interpretation in terms of effective degrees of free- 
dom of the model. We now proceed assuming that the Ignorance has been used 
as a score. Define 8 by 

R loo (\) = R0) + ± (24) 



It is possible to show ( Stonel . Il977h that for a model with free parameters, 
S asymptotically equals the dimension of the parameter space. Using similar 
calculations in the present case, it is possible to show that S = d + 1 — 0(A) 
for small A and 5 — ► 1 for large A, where d is the number of parameters in the 
model (not counting the intercept). If we interprete S as the number of effective 
degrees of freedom of the model, we obtain the reassuring conclusion that for 
vanishing penalty the model has d + 1 effective degrees of freedom, while for 
increasing penalty the number of effective degrees of freedom reduces to 1, owing 
to the fact that no penalty was imposed on the intercept /3p. Equation (12411 is 



a version of Akaike's information criterion (AIC, see e.g. lHastie et al 



Akaike recommends that if models are indexed by a parameter A, say, the model 
with minimum 

AIC:=2i?(/3) + 2^. (25) 

should be selected, which we see is asymptotically equivalent to minimising 
R loa . Although AIC and leave-one-out error are asymptotically the same, the 
two quantities can differ somewhat for very small sample sizes. If the degrees 
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of freedom of a model are known for some reason, it is possible to use the 
AIC directly as a criterion for determining the regularisation parameter. We 
will however go the other way and, knowing _R loo and R((3), determine S for 
diagnostic purposes. 

There is a corresponding relation for the Brier score, rel ating R^ nn , R ( 0) an d 



the degrees of freedom. This statistic, known as C p statisti dHastie et al. I (l200lh . 
is given by 



C p :=R0) + 2-E[g(l-g)]. (26) 



6 
~N~ 

The derivation of C p assumes that g(l — g) is approximately constant. This 
might be justified in many linear regression situations or if g has a sharply 
concentrated distribution, but in general this seems to be a quite idealistic as- 
sumption. A direct calculation though of the leave-one-out error Equation (j2"Tj) 
with the approximation ()22f) , both valid for any score, does not suffer from these 
problems and were found here to give much better results. 

As said, the AIC or the C p -statistic might still be a last resort if calculating 
the leave-one-out coefficients f3{ is difficult or impossible. It is then necessary 
to obtain 6, the number of effective degrees of freedom, by other means. This 
can require tricky analysis. The next subsection discusses an interesting mod- 
ification of the current setup for which the number of degrees of freedom are 
fortunately known. 

3.2 Li— Type Regularisation, or the Lasso 

In the previous subsection, we regularised our estimates b y constrain i ng the 
size of /?, measured in the L2~sense. For linear regression, iTibshiranil (|l996f ) 



suggested to use the Li-norm as an alternative, that is, the score is minimised 
under the constraint \/3\ < n, where \ j3\ := X)fc=i \Pk\- As before, no constraint 
is placed on the intercept. The resulting regression technique has become known 
as the lasso. An interesting feature of the lasso though is that with increasingly 
tight constraining, some coefficients become exactly zero. Interpreting the corre- 
sponding inputs as "less important" , the lasso technique is a ppealing also from 
a diagnostic point of view. Recently, it has been shown by IZou et all (|2007f ) 



that, consistent with intuition, the number of degrees of freedom of the lasso is 
given by the number of nonzero coefficients. 

The main features of the lasso persist when logistic models are used with 
an Li-penalty on the coefficients, that is, with increasingly tight constraining, 
some of the coefficients vanish exactly. The name "lasso" will be kept also for the 
logistic case, even though it was originally used for the linear case only. Calcu- 
lating the coefficients for the lasso is more involved than for standard regression 
or L2 _ regularised regression and requires quadratic optimisation techniques. We 
have not rigorously proven that the number of degrees of freedom in the logistic 
case is still given by the number of nonzero coefficients, although this appears to 
be quite plausible. Hence we suggest to determine the regularisation parameter 
by minimising 

AIC = 2i?(/3) + 2^, (27) 

where 5 is the number of nonzero coefficients, R is the empirical Ignorance score, 
and (3 is the coefficient which optimises R{(3) under the constraint \(3\ < fi or 
equivalently R{j3) + X\(3\. 
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4 Numerical Experiments 

In this section, regularised logistic regression is applied to the occurrence of 
precipitation. Several weather stations in Western Europe were investigated, 
with similar findings. As a representative example, results for Heligoland, Ger- 
many (WMO 10015) arc presented here. As inputs, high resolution deterministic 
and ensemble forecasts were used. The ensemble forecasts consist of the 50 (per- 
turbed) member ensemble, produced by the then -operational ECMWF global 
ensemble prediction system. Station data of precipitation was kindly provided 
by ECMWF as well. Forecasts were available for the years 2001-2005, featuring 
lead times from one to ten days. All data verified at noon. 

To form the inputs, high resolution, control and ensemble forecasts for 
mean sea level pressure, two metre temperature, and precipitation itself were 
used. Different combinations of inputs were tested. To describe the input com- 
binations, we will use the following abbreviations: We write prep, mslp, and 
t2m for precipitation, mean sea level pressure, and two metre temperature, re- 
spectively. The high resolution forecast is indicated with a suffix Ji, while the 
control and the ensemble carry the suffixes _c and _e, respectively. For example, 
the high resolution mean sea level pressure forecast is denoted by mslp_h, or 
the ensemble two metre temperature forecast by t2m_e. The input season is 
simply the phase of the year, that is, 



In total, four different combinations were investigated (see TableQJ. All com- 
binations include season. The first combination adds prcpji, resulting in 3 in- 
puts. The second combination uses all available high resolution forecasts twice: 
plain and squared (8 inputs), thereby modelling potential nonlinear connections 
between precipitation events and the inputs. The third combination uses all 
available precipitation forecasts: prcp_h, prcp_c, and prcp_e (54 inputs). The 
fourth combination uses all available forecasts: high resolution, control and en- 
semble for precip, pressure and temperature. Again, each forecast is included 
both plain and squared. This combination comprises 314 inputs. I should say 
that, to the best of my knowledge (although I cannot claim to have combed 
the literature very thoroughly), so far only the ensemble mean and spread have 
been used as inputs to logistic regression. This might have been done either to 
avoid over-fitting or in the belief that the ensemble does not contain relevant 
information beyond mean and spread. 

The results for the four combinations are displayed in Figures HHH which 
show the empirical score (top panels) and the number of effective degrees of 
freedom (bottom panels), as defined in Equation (|24p . The performance is pre- 
sented in an incremental fashion: Figure [TJ top panel, shows the performance 
of combination I relative to climatology, Figure [2j top panel, shows the perfor- 
mance of combination II relative to combination I and so forth. The confidence 
bars for all plots were obtained using 10-fold cross validation. Figure [TJ top 
panel, demonstrates the interesting (albeit maybe not surprising) fact that the 
high resolution forecast for precipitation contains a fair amount of probabilistic 
information, if processed correctly. The forecast even seems to have skill out to 
day 9. Since this model is trained on around 1600 samples but has only four 
parameters, it is not surprising that the number of effective degrees of freedom 




n = no. of the day (28) 
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S is between 4 and 5. The somewhat odd finding that S is even larger than 
the total number of parameters has two reasons. Firstly, we are dealing with a 
finite data set, while the result is true only in the limit of an infinite amount of 
data. Secondly, due to the penalisation, our parameters are not asymptotically 
unbiased. Strictly speaking, this entails a further correction to Equation (1241) . 
which we found to be less than .1 in all considered examples. 

Adding high resolution forecasts for mean sea level pressure and temperature 
to the mix generally adds skill, as Figure top panel, demonstrates. There 
seems to be no effect though at high lead times, like 9 days. We will see later 
that there is strong indication that the squared mean sea level pressure adds 
additional information, more specifically the high resolution mslp_h 2 at short 
lead times and the ensemble mslp_e 2 at longer lead times. What we can see 
already from Figure bottom panel, though is that the number of EDF's 
increases, in particular for lead times 48h and 96h, where this combination 
features the largest increase in skill over combination I. The number of effective 
degrees of freedom S is however always significantly smaller than 9, the number 
of coefficients in this model. This indicates that not all of the additional inputs 
add independent information. 

Somewhat surprisingly, the precipitation ensemble (along with prcpJi and 
prcp_c) shows close to no improvement in skill over combination I in this con- 
text, apart from lead time 48h maybe (Fig. [3j top panel). The number of 
effective degrees of freedom S is always far less than the number of coefficients 
in this model, and from lead times 96h onwards, S is even comparable to what 
was found for combination II. 

Significant increase in skill is obtained using combination VI (Fig. 0J top 
panel). In addition to combination III, the present combination uses pressure 
and temperature as well as all variables once plain and once squared. It would of 
course be interesting to know which of the inputs makes the biggest difference. 
We have not investigated this in full, but a partial answer will be obtained 
later using the lasso technique. What is important here is that despite the 
large number of coefficients (314), there appear to be no signs of over- fitting. 
Figure |4j bottom panel, demonstrates that the number of effective degrees of 
freedom S, albeit always far less than the number of coefficients in this model, 
is significantly larger than for any other combinations tested here. We can 
conclude that the additional inputs in fact do add additional information. An 
interesting aside is that 5 always seems to have a maximum around 96h lead 
time. We have not investigated the reason, but speculate that this is related 
to the ensemble generation. The ensembles are free runs of slightly perturbed 
initial conditions. The spread of the ensemble typically grows initially, due to 
the local instabilities. The growth of uncertainty is of course the reason why 
ensembles are used in the first place. Eventually, the spread will saturate due 
to nonlinear effects. The middle ground is where each ensemble member adds 
the most information to the whole. 

We are now going to discuss some sample results for the lasso. The goal 
is to get some idea as to the relative importance of the inputs. As discussed, 
an interesting feature of the lasso is that with increasingly tight bound on the 
coefficients (i.e. decreasing /i), some coefficients become zero. The numerical 
experiments discussed here readily demonstrate this effect. As an example, 
forecasts for lead time 48h were considered. As inputs the variables prep, mslp, 
t2m, were used, along with season. To get the big picture, we dispensed with 
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Inputs 

season, prcpji 

season, prcpji, prcp_h 2 , mslpji, mslpji 2 
t2m_h, t2m_h 2 

season, prcpji, prcp_c, prcp_e 

season, *_h, *_c, *_e, *Ji 2 , *_c 2 , *_e 2 
with * = prep, mslp, t2m 



No. of Coeffz. 



I 



3 



II 



8 



III 



54 



IV 



314 



Table 1: Tested combination of inputs. The number of coefficients does not 
include the intercept. 

including all available types of forecasts but concentrated on the high resolution 
forecasts, the high resolution forecast squared, the ensemble mean, the ensemble 
mean squared, and the ensemble variance^]. In fact, there was enough data 
available so that in this case, the coefficients could have been determined without 
regularisation. In order to see regularisation "in action", we reduced the data 
set to only 6 months. 

The coefficients as a function of the bound fi are displayed in Figures EHE1 
In order to avoid clutter, the coefficients corresponding to different variables 
are displayed in different figures. All plots show the AIC for reference (dashed 
grey line and right ordinate). The AIC shows a minimum at around /i = 4, so 
this would be the bound chosen by the AIC criterion. At this point, there are 
10 nonzero coefficients. In terms of importance of individual inputs, the mag- 
nitude of the coefficients is probably not as descriptive as the point at which 
a given coefficient drops out of the mix with increasingly tight bound. The 
longest-surviving inputs are (in descending order) the prep ensemble mean, the 
mslp high resolution squared and the mslp high resolution. The prep high res- 
olution seems to be almost as important. t2m gets coefficients of comparable 
magnitude, but drops out earlier. The behaviour of prep ensemble variance is 
similar. The most interesting observation is probably the contribution of the 
mslp high resolution forecast to the model, displayed in Figure [5J Reassuringly, 
we obtain that the probability of rain decreases with both increasing and de- 
creasing pressure, with a maximum at around 1012hPa. Of course, this analysis 
neglects the influence of the other inputs, which are certainly not independent 
of pressure. In addition, Figure [7] shows that the contribution of the mslp en- 
semble mean is very similar to that of the mslp high resolution forecast, albeit 
with smaller magnitude. It turns out that the importance of the mslp ensemble 
mean increases with increasing lead time (not shown) , with high resolution and 
ensemble mean effectively reversing role at around lead time 120h. 

5 Conclusion 

Logistic models were considered as a statistical tool to map both deterministic 
and ensemble weather forecasts into probabilities. The parameters of the logistic 
model were found by optimising proper scoring rules. Due to the variety of 

1 Alternatively, one could take the the mean of the squared ensemble members, but choosing 
the variance makes the model's dependence on the ensemble spread explicit. 
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Figure 1: Empirical Ignorance score (top panel) and effective degrees of free- 
dom d (bottom panel) for input combination I (see table [l}. The reference here 
is climatology. The confidence bars were obtained using tenfold cross validation. 
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Figure 2: Empirical Ignorance score (top panel) and effective degrees of free- 
dom S (bottom panel) for input combination II (see table [T]). The score is 
presented with combination I as a reference. The confidence bars were obtained 
using tenfold cross validation. 
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Figure 3: Empirical Ignorance score (top panel) and effective degrees of free- 
dom 6 (bottom panel) for input combination III (see table [T|) . The score is 
presented with combination II as a reference. The confidence bars were ob- 
tained using tenfold cross validation. 
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0.02 



Combination IV 




240 



240 



Lead Time (hours) 

Figure 4: Empirical Ignorance score (top panel) and effective degrees of free- 
dom 5 (bottom panel) for input combination IV (see table [JJ . The score is 
presented with combination III as a reference. The confidence bars were ob- 
tained using tenfold cross validation. 
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Figure 5: The coefficients corresponding to the input season for a lasso model, 
plot over the regularisation parameter fi. Here, □ = cos(. . .) and = sin(. . .) 
of Equation [2E 




Figure 6: The coefficients corresponding to the inputs related to prep (pre- 
cipitation) for a lasso model, plot over the regularisation parameter fi. Here, 
□ = High Resolution Forecast, = (High Resolution Fc.) 2 , A = Ensemble 
Mean, y = (Ensemble Mean) 2 , Q = Ensemble variance. 
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Figure 7: The coefficients corresponding to the inputs related to mslp (mean 
sea level pressure) for a lasso model, plot over the regularisation parameter /i. 
For explanation of line markers see Figure [6] 




Figure 8: The coefficients corresponding to the inputs related to t2m (tempera- 
ture) for a lasso model, plot over the regularisation parameter \i. For explanation 
of line markers see Figure [51 



17 



0.5 




-2' 1 1 1 1 1 

990 1000 1010 1020 1030 1040 

Pressure 



Figure 9: Contribution of the mslp (pressure) high resolution forecast to the 
lasso model, more specifically, to the logarithmic odds ratio. There is a signifi- 
cant contribution from the term quadratic in mslp. 



different forecast products that are typically available for a single target, the 
number of potential inputs to the logistic model might be large, comparable even 
to the number of forecast instances. This number is increased even further if, 
next to the original inputs themselves, some nonlinear functions of them are to 
be included (for example squared terms or bilinear combinations of two inputs). 
This renders a reliable determination of the model coefficients difficult due to 
over-fitting. This paper presents systematical approaches to avoid over-fitting 
in logistic models. Motivated by similar approaches for linear regression, the 
coefficients are penalised, thereby reducing their potential variance. Techniques 
for model selection using efficient leave-one-out cross validation are discussed. 
As an example, probabilities of precipitation are considered. Not only forecasts 
for precipitation, but also pressure and temperature are used (both as high 
resolution as well as ensemble forecasts) . It is demonstrated that these forecasts 
do add information to the pure precipitation forecasts. Despite the large number 
of inputs, it is demonstrated that the parameters can be determined in a reliable 
fashion. A variant of the logistic model called the lasso was discussed. In this 
model, constraining the coefficients can result in some of them being exactly 
zero, allowing to identify them as less important. A notable observation here is 
the large contribution of the mslp high resolution forecast itself and its square 
for small lead times, and similarly of the mslp ensemble forecast for larger lead 
times. 
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A Leave— one— out parameters 

In this appendix, I want to demonstrate that Equation [22] is a good approxi- 
mation for the linear response to the input Xi of the model with parameters 
that is, which are based on a training set without the i-th input-target pair. I 
will use the following abbreviations (consistent with the notation elsewhere in 
the paper): 

77, := Xj/3* for all £ = 1 ... iV (29) 
9i : = g ( m ) for all £= 1...N (30) 
1 N 

R W : = jyE^y*) (3i) 
fc=i 

Rxffl := R(f3) + \\(3\ 2 , (32) 

The reader be again reminded that the penalty term |/3| 2 = J2k=i@k does 
not involve the intercept /3o- Let (5 be a stationary point of R\((3), that is 
dR\($)/dj3 = 0. In analogy to the definitions (|2"9"1) and (|3T)|) . I set rji :— Xi$*, 
cji := g(rji) for all £ = 1 . . . N. Let (3% be a stationary point of R\((3) but omitting 
the £— th input -target pair, or more specifically, 



o = a 



N 



(33) 



Now from the definition of R\ we get 

N 

NR x (f3) = J2 S(g kl y k )+NX\(3\ 2 = ]T S(g k , y k )+(N-l)X\(3\ 2 +X\l3\ 2 +S( gi , Vl ). 

k=l k=£i 

(34) 

We now take the derivative of this at f3i and divide by N . Using Equation |33|) 
we obtain 

dpRxifii) = jjdp [S{g hVi ) + \\f3\ 2 ]^ t . (35) 
On the other hand, we can expand dpR\(j3) to first order in — p at (3 

9pR\ (Pi) e df,R x (fi) +tfpR\0) (pt - $)* , (36) 
=o 
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where the first term vanishes because (3 is a stationary point of R\ ((3) . The two 
relations (|35[) and (|36[) are central to the argumentation in this appendix. 

We continue with determining /3$. To this end, we also linearise the first 
equation (|3"5|) . set it equal to (f3"6")l . and solve for j3% — $. This yields 



Pt-^ = HT l [dfsSiguVif + 2A/3* 
with the following abbreviations: 



(37) 



A := 



H, 



/ o 


° \ 




A 







V 


A / 


= H 


- dj 2 S(gi,yi) 



(38) 



(39) 
(40) 



H = Ndf J2 Rx(P)-2A 

We will now calculate the derivatives of S in Equation (|3T|) . Firstly, we have 

d/3S(gi,i/i) = 9 r ,S , (g(^),y i )x i = d i x l (41) 

where di = d v S(g(f/i), yi) is used as a shorthand (consistent with Equation . 
Secondly, we have 

9p2S(gi, = <Nix\xi (42) 

where = d^ 2 S(g(fii), yi) is used as a shorthand (consistent with Equation [TTJ) . 
With W fci = S k iw k and X ki = xj^ we get 



H = X'WX + 2(N - 1)A, Hj = H - w.x*^ 



(43) 



It is not necessary to invert the full matrix for every i. Once the inverse of H is 
known , applying the Sherman-Morrison formula (see e.g. iGolub and Van Loan 
(|l996l )) gives 

H7 1 = H _1 + . ; . H-^H- 1 . (44) 



Consequently, 



iH7 l = 



(45) 



Since r\% — fji = Xif3~ — Xi$*, we can employ Equation ([37]) together with (jITj) 
and (l4"5"j) to obtain 



m = v- 



1 — w,x 



H ijcj 



x*d 4 + 2A/3* 



(46) 



Note that as soon as H _1 has been computed, very few additional operations 
are required for every i. The matrix inversion H _1 needs to be performed only 
once. 
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